Heterogeneity shapes groups growth in social online communities 
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Many complex systems are characterized by broad distributions capturing, for example, the size of 
firms, the population of cities or the degree distribution of complex networks. Typically this feature 
is explained by means of a preferential growth mechanism. Although heterogeneity is expected to 
play a role in the evolution it is usually not considered in the modeling probably due to a lack of 
empirical evidence on how it is distributed. We characterize the intrinsic heterogeneity of groups in 
an online community and then show that together with a simple linear growth and an inhomogeneous 
birth rate it explains the broad distribution of group members. 

PACS numbers: 89.20.-a, 89.20.Hh, 89.75.Fb 



I. INTRODUCTION 

Many complex systems are characterized by 
heavy-tailed distributions, e.(?., Zipf's law (originally 
used to describe the frequency of words [T]), Pareto's 
law (originally describing the wealth of nations [2]), 
and more recently scale-free topologies (capturing the 
degree distribution of complex networks [3] ) [H [S] . This 
property is typically perceived as a symptom of the 
rich-gets-richer principle, and models implementing 
some degree of preferential growth are usually the first 
approach to explain heavy-tailed distributions |31 [5l416j . 
In line with the rich-gets-richer principle, the Gibrat's 
law suggests that the expected growth of a firm, a city 
or social activity is proportional to its size [T7l420j . 
However, in general, less attention has been devoted 
to the time evolution of complex systems probably 
due to the lack of empirical data along time (for some 
exceptions see [SI [^TlE5] b In many network growth 
models the time unit is mapped to the number of new 
arriving elements, which makes it difficult to compare 
the results with real data. Moreover, many models 
assume that the elements are born identical leading 
to correlations between age and frequency (of words, 
wealth, degree or size) which are not fully supported 
by empirical observations |24) . In many real systems, 
especially in social systems, individuals or elements 
are very diverse. In this direction, some models incor- 
porating heterogeneity in the form of fitness, hidden 
variables or ranking have been proposed [551 HZl 1221? ? 
]. However, there is rather little empirical work showing 
how intrinsic heterogeneity is distributed and its role in 
complex system growth [30l [31] . Based on data collected 
on a daily basis on the time evolution of an online 
social system we will characterize the heterogeneity 
of the groups and identify the heterogeneity and the 
distributed birth dates as key players explaining the 
heavy-tailed distribution of group sizes and the apparent 
proportional growth of groups to their size. 
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We study an online community called Flickr |32j , where 
members can create and join groups. The groups in 
Flickr are mainly used to collaboratively post photos as- 
sociated with the theme of the group. We will consider 
each group as an element of the system characterized by 
the number of members belonging to the group (group 
size). We have collected two datasets containing in to- 
tal over 260,000 member-created groups in Flickr, which 
accounted for over 65% of all public groups existing in 
Flickr. The first dataset has high temporal resolution 
and a wide time window. It contains 9,503 groups tracked 
for 350 days, between June 5, 2008 and May 20, 2009, 
by the publicly accessible external service called Group- 
Trackr j33j . The service tracked on a daily basis the num- 
ber of members of the groups. The second dataset has 
shorter time window and minimal temporal resolution, 
but it covers a larger number of groups. It contains over 
260,000 public groups for which we gathered information 
on the number of members, collected in two snapshots 
on December 18, 2009 and January 29, 2010. For these 
groups we also gathered estimated information on their 
birth date. As an estimation of the group birth date we 
consider the time when the first photo was posted to the 
group pool, as the first photo is normally posted to the 
pool soon after the group creation. The oldest groups in 
our dataset date back to July 16, 2004. 



II. GROUPS' GROWTH IN FLICKR 

We first analyze the time evolution of groups. In 
Fig. [T^ we show how typical groups grow in number of 
members on a daily basis during the period of one year. 
As a first approach, a linear growth captures the indi- 
vidual trend (despite evident deviations in the form of 
sudden jumps). We have performed linear regression of 
time evolution of sizes of 9,503 groups over the period of 
almost one year. For about half of these groups the coefh- 
cient of determination has a value over 0.95, and more 
than 80% of the groups larger than 1000 has higher 
than 0.95. The difference comes from the fact that the 
larger groups are affected less by the fluctuations of size. 
Aggregated residual plots do not show any clear trend 
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FIG. 1. (Color online) Characterizing the time evolution of online groups, (a) Time evolution of the group size for a represen- 
tative sample of small and large groups, (b) Distributions of groups' growth a (open circles) with fitted log-normal distribution 
(line). The growth per day a is estimated based on growth over 6 weeks, (c) Distribution of group ages. 



deviating from our linear model. The time series cover 
considerable part of the average lifespan of the groups. 
Thus, we consider that groups grow linearly in time, the 
size Qi of the group i evolves as 

g,^l + a,{t-t^^) = l + a,n , (1) 

where is the growth per unit of time, tf is the birth 
date and is the current age of group i. We estimate 
the two parameters for 260,000 groups. The growth ai for 
each group i is calculated as the change of its size during 
6 weeks, per day. A log-normal distribution provides the 
best fit to the distribution of growth values a (Fig. [T]d) 
with average /i = Ina = —3.62 and standard deviation 
a ~ 1.57. Finally, we estimated the current ages of all 
groups, finding that the number of groups created daily 
has been growing (almost linearly) in time (Fig. [ij;) . 



III. LINEAR GROWTH MODEL WITH 
HETEROGENEOUS BIRTH AND GROWTH 

Based on those findings we propose a minimal model of 
the time evolution of group sizes in Flickr, a linear growth 
model with heterogeneous birth and growth, which in 
short we will refer as the heterogeneous linear growth 
model. The model proceeds as follows, at each time step 
t: (i) new groups are created in the system. The num- 
ber of groups created in each time step increases linearly 



with t. Each newly created group i starts with one mem- 
ber and it is assigned its own growth value a^, drawn 
from a log-normal distribution. Growth value ai remains 
unchanged for the simulation time; (ii) the size of each 
group i is increased by a^. 

We have run numerical simulations of the heteroge- 
neous linear growth model where each time step of the 
simulation corresponds to a single day. We have simu- 
lated 1959 days in Flickr, from the moment when the 
first group from our dataset appeared. As a result of 
the numerical simulations we obtain the daily evolution 
of the sizes of over 260,000 artificial groups. The distri- 
bution of the final sizes of the groups reproduces with 
a good agreement the observed distribution (Fig. [2^). 
As it can be seen from Fig. [2^ there is a small diver- 
gence for large group sizes, which could be explained by 
the deviations -mostly for small groups- from the linear 
growth assumption. First, the strong fluctuations of the 
time evolution of group sizes of the small groups (see the 
jumps in Fig. [I]) lead to a larger 'apparent' growth than 
the real one, therefore leading to an over-estimation of 
their growth a and, as a consequence, the model displays 
a larger number of big groups than in the real system. 

The average growth of groups of the same size, (ajg), 
shows that bigger groups grow faster (Fig. [2}d) both for 
the real data and the model in accordance with the 
Gibrat's law: {a\g) oc g. This result is obtained even 
though the microscopic rules of the model do not imple- 
ment the rich-gets-richer principle. The average growth 
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t, if the first group was created at time t = 0. We trans- 
form Eq. ([2]) replacing the joint probability Pag{a{g, g)) 
by Par (0(3, t)) and making the assumption that r and 
a are independent random variables: 



{a\g) - •^(r-i)/t"^"(°')^^'^'^'^"'^))iff^" 
/(7-i)/tP"("K('^("'5))f^c?a 



(3) 



The numerical solution of Eq. Q for log-normal Pa and 
linear pr is plotted in Fig.[2]D. Similarly the distribution 
of group sizes: 



Pgig) 



Pgr{g,T)d7 



da 



Pa{a{g,T))pr{T)—dT 



(4) 
(5) 



is plotted in Fig. [2^. As one can see the solutions for both 
the average growth and the size distribution are in good 
correspondence with the results of numerical simulations, 
which indicates that the assumptions of independent ran- 
dom variables and linear growth are reasonable!^ 



IV. HETEROGENEITY VS. PREFERENTIAL 
GROWTH 



FIG. 2. (Color online) The heterogeneous linear growth model 
vs. real data, (a) Complementary cumulative distribution 
function of groups sizes for the real data (circles), the het- 
erogeneous linear growth model (filled triangles) and its an- 
alytical solution (solid line), (b) Average daily growth as a 
function of the initial size of the groups, estimated for the pe- 
riod of 6 weeks and averaged over all groups of a given initial 
size, for: the real data (circles), the model (triangles) and its 
numerical solution (line). The dashed line corresponds to the 
linear behavior l^a\g) ~ g. 



is an average over all groups of a given size, each of them 
growing linearly. Due to the heterogeneity and the lin- 
ear growth, at a given time larger groups consist of old 
groups that grow slowly and younger groups that grow 
faster. Thus, the observation of preferential growth for 
groups of the same size does not reflect in this case an un- 
derlying rich-gets-richer principle, but it is a consequence 
of the competition of groups with different growth values 
and ages. 

The statistical properties of the model can be esti- 
mated analytically. From the definition, the average 
growth of groups of the same size is given by: 



J^^i)/tPc.g{a,g)da 



(2) 



where Pag{a, g) is the joint probability of having a group 
of size g and growth rate a, and J Pag{a, g)dadg = 1. 
The lower limit of the integral is given by Eq. ([T]) and 
depends on g, and the maximum value of r is limited to 



We have shown that the heterogeneous linear growth 
model captures the statistical properties which com- 
monly are attributed to the preferential growth mech- 
anism. Thanks to the intrinsic heterogeneity, different 
growth patterns are permitted, even if groups have the 
same number of members at any point in time. One 
can see an example of this in Fig. [l^, where group sizes 
are crossing themselves in time, though they continue 
to grow as they grew before the crossing. To make a 
direct comparison between the two mechanisms, hetero- 
geneity vs. preferential growth, we consider the Simon 
model [10]. The Simon model has been originally pro- 
posed to explain the distribution of words' frequency in 
a written text. At every time step, a word is added to 
the text: with a given probability g it is a new word; 
otherwise, the word is chosen at random from the text, 
so the words which appear more frequently are chosen 
more often. We have adapted the Simon model to our 
system. We have set the parameters to obtain the same 
total number of groups and members as in the real case; 
also the number of new groups created in the system 
in each time step of the Simon model grows linearly, to 
isolate the effect of the heterogeneity. First, in the Si- 
mon model the final size of groups is heavily determined 
by their initial size measured one year before (Fig. ^) , 



^ Equations (jsjl and ([sjl are easy to solve if a and r are independent 
random variables and pa is a power-law distribution. In such a 
case one can show that {a\g) oc g and that Pg{g) is a power-law 
as well. 
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FIG. 3. (Color online) Comparison of Simon and heteroge- 
neous linear growth model vs. real data, (a) Initial and final 
group sizes over a period of 350 days for the real data (cir- 
cles), the heterogeneous linear growth models (filled triangles) 
and Simon model (diamonds). Each point represents a single 
group, there are 9,503 points plotted for each set of points, 
(b-d) Box plots with whiskers at 9th/91st percentile of final 
size of groups as a function of their age at the time of the 
measurement for 260,000 groups for (b) the real data, (c) the 
heterogeneous linear growth model, (d) the Simon model. 



thus there is little heterogeneity among the groups, in 
contrast to the heterogeneous linear growth model which 
displays a degree of heterogeneity similar to the one of 
real groups. Second, for the Simon model the correlation 
of size and age is strong, while it is weak for real groups 



and the heterogeneous linear growth model (Figs.[3|D-d][^ 
The wide spread of group sizes corresponds to the high 
heterogeneity of groups, which is not captured by the 
preferential growth model (as observed in other systems 
as, for instance, in the World Wide Web where the num- 
ber of links to the page is not strongly correlated with 
age of the web page |;24 ) . 



V. DISCUSSION 

In summary, we have proposed a simple growth model 
of heterogeneous elements with associated growing coun- 
ters, based on the findings for a social system in an online 
community. We found that the model captures many of 
the features of the real system of online groups, namely 
the heavy-tailed distribution of group sizes, the average 
growth proportional to the current size of groups and 
the weak correlation between the age and the size of 
groups. Furthermore we made a direct comparison of 
the heterogeneous linear growth model with a preferen- 
tial growth model and showed the similarities and the 
differences between these models. In the heterogeneous 
linear growth model the heavy-tailed distribution of fi- 
nal sizes of elements does not emerge from the growth 
process itself (e.g., rich-gets-richer principle), but from 
the intrinsic heterogeneity of elements which take part in 
this growth process. This certainly does not answer the 
question why some groups grow faster than the others, 
as we do not understand yet what factors influence the 
fitness of the groups. However it points out that it does 
not have to be due to the fact that one group is bigger 
than the other as in preferential attachment models. The 
simplicity of our approach suggests that the characteri- 
zation of the heterogeneity may play an important role in 
understanding the origin of broad distributions and the 
time evolution of many real systems. 
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^ In the heterogeneous linear growth model the average size of 
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groups of given age is {g\T) = 1-1- r exp (/i + where fi and 
(T are parameters of the lognormal distribution. In the Simon 

model, it is given by (glr) = ( where T is the 

age of the system, m controls thtf liJImbef of new users introduced 



into the system in each time step (mT), and q is the probability 
of new group creation within the model (in our case T = 1959, 
m = 10 and q = 0.014). 
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